13 research outputs found

    Code similarity and clone search in large-scale source code data

    Get PDF
    Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer's guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them "online code clones."' They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the-art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large-scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse

    A Taxonomy for Mining and Classifying Privacy Requirements in Issue Reports

    Full text link
    Digital and physical footprints are a trail of user activities collected over the use of software applications and systems. As software becomes ubiquitous, protecting user privacy has become challenging. With the increasing of user privacy awareness and advent of privacy regulations and policies, there is an emerging need to implement software systems that enhance the protection of personal data processing. However, existing privacy regulations and policies only provide high-level principles which are difficult for software engineers to design and implement privacy-aware systems. In this paper, we develop a taxonomy that provides a comprehensive set of privacy requirements based on two well-established and widely-adopted privacy regulations and frameworks, the General Data Protection Regulation (GDPR) and the ISO/IEC 29100. These requirements are refined into a level that is implementable and easy to understand by software engineers, thus supporting them to attend to existing regulations and standards. We have also performed a study on how two large open-source software projects (Google Chrome and Moodle) address the privacy requirements in our taxonomy through mining their issue reports. The paper discusses how the collected issues were classified, and presents the findings and insights generated from our study.Comment: Submitted to IEEE Transactions on Software Engineering on 23 December 202

    Mining the Characteristics of Jupyter Notebooks in Data Science Projects

    Full text link
    Nowadays, numerous industries have exceptional demand for skills in data science, such as data analysis, data mining, and machine learning. The computational notebook (e.g., Jupyter Notebook) is a well-known data science tool adopted in practice. Kaggle and GitHub are two platforms where data science communities are used for knowledge-sharing, skill-practicing, and collaboration. While tutorials and guidelines for novice data science are available on both platforms, there is a low number of Jupyter Notebooks that received high numbers of votes from the community. The high-voted notebook is considered well-documented, easy to understand, and applies the best data science and software engineering practices. In this research, we aim to understand the characteristics of high-voted Jupyter Notebooks on Kaggle and the popular Jupyter Notebooks for data science projects on GitHub. We plan to mine and analyse the Jupyter Notebooks on both platforms. We will perform exploratory analytics, data visualization, and feature importances to understand the overall structure of these notebooks and to identify common patterns and best-practice features separating the low-voted and high-voted notebooks. Upon the completion of this research, the discovered insights can be applied as training guidelines for aspiring data scientists and machine learning practitioners looking to improve their performance from novice ranking Jupyter Notebook on Kaggle to a deployable project on GitHub

    BigCloneBench Considered Harmful for Machine Learning

    No full text

    A taxonomy for mining and classifying privacy requirements in issue reports

    No full text
    Context: Digital and physical trails of user activities are collected over the use of software applications and systems. As software becomes ubiquitous, protecting user privacy has become challenging. With the increase of user privacy awareness and advent of privacy regulations and policies, there is an emerging need to implement software systems that enhance the protection of personal data processing. However, existing data protection and privacy regulations provide key principles in high-level, making it difficult for software engineers to design and implement privacy-aware systems. Objective: In this paper, we develop a taxonomy that provides a comprehensive set of privacy requirements based on four well-established personal data protection regulations and privacy frameworks, the General Data Protection Regulation (GDPR), ISO/IEC 29100, Thailand Personal Data Protection Act (Thailand PDPA) and Asia-Pacific Economic Cooperation (APEC) privacy framework. Methods: These requirements are extracted, refined and classified (using the goal-based requirements analysis method) into a level that can be used to map with issue reports. We have also performed a study on how two large open-source software projects (Google Chrome and Moodle) address the privacy requirements in our taxonomy through mining their issue reports. Results: The paper discusses how the collected issues were classified, and presents the findings and insights generated from our study. Conclusion: Mining and classifying privacy requirements in issue reports can help organisations be aware of their state of compliance by identifying privacy requirements that have not been addressed in their software projects. The taxonomy can also trace back to regulations, standards and frameworks that the software projects have not complied with based on the identified privacy requirements

    Automatically recommending components for issue reports using deep learning

    No full text
    Today’s software development is typically driven by incremental changes made to software to implement a new functionality, fix a bug, or improve its performance and security. Each change request is often described as an issue. Recent studies suggest that a set of components (e.g., software modules) relevant to the resolution of an issue is one of the most important information provided with the issue that software engineers often rely on. However, assigning an issue to the correct component(s) is challenging, especially for large-scale projects which have up to hundreds of components. In this paper, we propose a predictive model which learns from historical issue reports and recommends the most relevant components for new issues. Our model uses Long Short-Term Memory, a deep learning technique, to automatically learn semantic features representing an issue report, and combines them with the traditional textual similarity features. An extensive evaluation on 142,025 issues from 11 large projects shows that our approach outperforms one common baseline, two state-of-the-art techniques, and six alternative techniques with an improvement of 16.70%–66.31% on average across all projects in predictive performance

    Automatic Classifying Self-Admitted Technical Debt Using N-Gram IDF

    Get PDF
    2019 26th Asia-Pacific Software Engineering Conference (APSEC) ,Putrajaya, Malaysia,Technical Debt (TD) introduces a quality problem and increases maintenance cost since it may require improvements in the future. Several studies show that it is possible to automatically detect TD from source code comments that developers intentionally created, so-called self-admitted technical debt (SATD). Those studies proposed to use binary classification technique to predict whether a comment shows SATD. However, SATD has different types (e.g. design SATD and requirement SATD). In this paper, we therefore propose an approach using N-gram Inverse Document Frequency (IDF) and employ a multi-class classification technique to build a model that can identify different types of SATD. From the empirical evaluation on 10 open-source projects, our approach outperforms alternative methods (e.g. using BOW and TF-IDF). Our approach also improves the prediction performance over the baseline benchmark by 33%This work has been supported by JSPS KAKENHI (Grant Number 16H05857 and 17H00731)
    corecore